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Abstract 

The organizer of a machine learning competition faces the problem of maintaining 
an accurate leaderboard that faithfully represents the quality of the best submission of 
each competing team. What makes this estimation problem particularly challenging is its 
sequential and adaptive nature. As participants are allowed to repeatedly evaluate their 
submissions on the leaderboard, they may begin to overfit to the holdout data that supports 
the leaderboard. Few theoretical results give actionable advice on how to design a reliable 
leaderboard. Existing approaches therefore often resort to poorly understood heuristics such 
as limiting the bit precision of answers and the rate of re-submission. 

In this work, we introduce a notion of leaderboard accuracy tailored to the format of a 
competition. We introduce a natural algorithm called the Ladder and demonstrate that it si¬ 
multaneously supports strong theoretical guarantees in a fully adaptive model of estimation, 
withstands practical adversarial attacks, and achieves high utility on real submission files 
from an actual competition hosted by Kaggle. 

Notably, we are able to sidestep a powerful recent hardness result for adaptive risk 
estimation that rules out algorithms such as ours under a seemingly very similar notion 
of accuracy. On a practical note, we provide a completely parameter-free variant of our 
algorithm that can be deployed in a real competition with no tuning required whatsoever. 


1 Introduction 

Machine learning competitions have become an extremely popular format for solving prediction 
and classification problems of all kinds. A number of companies such as Netflix have organized 
major competitions in the past and some start-ups like Kaggle specialize in hosting machine 
learning competitions. In a typical competition hundreds of participants will compete for 
prize money by repeatedly submitting classifiers to the host in an attempt to improve on their 
previously best score. The score reflects the performance of the classifier on some subset of the 
data, which are typically partitioned into two sets: a training set and a test set. The training set 
is publicly available with both the individual instances and their corresponding class labels. The 
test set is publicly available as well, but the class labels are withheld. Predicting these missing 
class labels is the goal of the participant and a valid submission is simply a list of labels—one 
for each point in the test set. 

The central component of any competition is the leaderboard which ranks all teams in the 
competition by the score of their best submission. This leads to the fundamental problem of 
maintaining a leaderboard that accurately reflects the true strength of a classifier. What makes 
this problem so challenging is that participants may begin to incorporate the feedback from the 
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leaderboard into the design of their classifier thus creating a dependence between the classifier 
and the data on which it is evaluated. In such cases, it is well known that the holdout set 
no longer gives an unbiased estimate of the classifier's true performance. To counteract this 
problem, existing solutions such as the one used by Kaggle further partition the test set into two 
parts. One part of the test set is used for computing scores on the public leaderboard. The other 
is used to rank all submissions after the competition ended. This final ranking is often referred 
to as the private leaderboard. While this solution increases the quality of the private leaderboard, 
it does not address the problem of maintaining accuracy on the public leaderboard. Indeed, 
numerous posts on the forums of Kaggle report on the problem of "overfitting to the holdout" 
meaning that some scores on the public leaderboard are inflated compared to final scores. To 
mitigate this problem Kaggle primarily restricts the rate of re-submission and to some extent 
the numerical precision of the released scores. 

Yet, in spite of its obvious importance, there is relatively little theory on how to design a 
leaderboard with rigorous quality guarantees. Basic questions remain difficult to assess, such as, 
can we a priori quantify how accurate existing leaderboard mechanisms are and can we design 
better methods? 

While the theory of estimating the true loss of a classifier or set of classifiers from a finite 
sample is decades old, much of theory breaks down due to the sequential and adaptive nature 
of the estimation problem that arises when maintaining a leaderboard. First of all, there is no 
a priori understanding of which learning algorithms are going to be used, the complexity of 
the classifiers they are producing, and how many submissions there are going to be. Indeed, 
submissions are just a list of labels and do not even specify how these labels were obtained. 
Second, any submission might incorporate statistical information about the withheld class labels 
that was revealed by the score of previous submissions. In such cases, the public leaderboard 
may no longer provide an unbiased estimate of the true score. To make matters worse, very 
recent results suggest that maintaining accurate estimates on a sequence of many adaptively 
chosen classifiers may be computationally intractable [HU, SU]. 

1.1 Our Contributions 

We introduce a notion of accuracy called leaderboard accuracy tailored to the format of a com¬ 
petition. Intuitively, high leaderboard accuracy entails that each score represented on the 
leaderboard is close to the true score of the corresponding classifier on the unknown distribu¬ 
tion from which the data were drawn. Our primary theoretical contributions are the following. 

1. We show that there is a simple and natural algorithm we call Ladder that achieves high 
leaderboard accuracy in a fully adaptive model of estimation in which we place no 
restrictions on the data analyst whatsoever. In fact, we don't even limit the number of 
submissions an analyst can make. Formally, our worst-case upper bound shows that if we 
normalize scores to be in [0,1] the maximum error of our algorithm on any estimate is 
never worse than 0((log(/c)/n)^^^) where k is the number of submissions and n is the size 
of the data used to compute the leaderboard. In contrast, we observe that the error of the 
Kaggle mechanism (and similar solutions) scales with the number of submissions as ^^k so 
that our algorithm features an exponential improvement in k. 

2. We also prove an information-theoretic lower bound on the leaderboard accuracy demon¬ 
strating that no estimator can achieve error smaller than Q((log(/c)/n)^^^). 
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Complementing our theoretical worst-case upper bound and lower bound, we make a 
number of practical contributions: 

1. We provide a parameter-free variant of our algorithm that can be deployed in a real 
competition with no tuning required whatsoever. 

2. To demonstrate the strength of our parameter-free algorithm we conduct two opposing 
experiments. The first is an adversarial—yet practical—attack on the leaderboard that 
aims to create as much of a bias as possible with a given number of submissions. We 
compare the performance of the Kaggle mechanism to that of the Ladder mechanism 
under this attack. We observe that the accuracy of the Kaggle mechanism diminishes 
rapidly with the number of submissions, while our algorithm encounters only a small bias 
in its estimates. 

3. In a second experiment, we evaluate our algorithm on real submission files from a Kaggle 
competition. The data set presents a difficult benchmark as little overfitting occurred 
and the errors of the Kaggle leaderboard were generally within the expected statistical 
deviations given the properties of the data set. Even on this benchmark our algorithm 
produced a leaderboard that is very close to that computed by Kaggle. Through a sequence 
of significance tests we assess that the differences between the two leaderboards on this 
competition are not statistically significant. 

In summary, our algorithm supports strong theoretical results while suggesting a simple and 
practical solution. Importantly, it is one and the same parameter-free algorithm that withstands 
our adversarial attack and simultaneously achieves high utility in a real Kaggle competition. 

An important aspect of our algorithm is that it only releases a score to the participant if the 
score presents a statistically significant improvement over the previously best submission of the 
participant. Intuitively, this prevents the participant from exploiting or overfitting to minor 
fluctuations in the observed score values. 

1.2 Related Work 

There is a vast literature on preventing overfitting in the context of model assessment and 
selection. See, for example. Chapter 7 of [HTF] for background. Two particularly popular 
practical approaches are various forms of cross-validation and bootstrapping. It is important to 
note though that when scoring a submission for the leaderboard, neither of these techniques 
applies. One problem is that participants submit only a list of labels and not the corresponding 
learning algorithms. In particular, the organizer of the competition has no means of retraining 
the model on a different split of the data. Similarly, the natural bootstrap estimate of the 
expected loss of a classifier given a finite sample is simply the empirical average of the loss on 
the finite sample, which is what existing solutions release anyway. The other substantial obstacle 
is that even if these methods applied, their theoretical guarantees in the adaptive setting of 
estimation are largely not understood. 

A highly relevant recent work [DFH+], that inspired us, studies a more general question: 
Given a sequence of adaptively chosen bounded functions X —»{0,1} over a domain X, 

estimate the expectations of these function E/i,...,E/j. over an unknown distribution V, given 
n samples from this distribution. If we think of each function as expressing the loss of one 
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classifier submitted to the leaderboard, then such an algorithm could in principle be used in 
our setting. The main result of [DFH^] is an algorithm that achieves maximum error 

O {min{log(fc)3/"(log (log |X| \og(k)/nf^]). 

This bound readily implies a corresponding result for leaderboard accuracy albeit worse than the 
one we show. One issue is that this algorithm requires the entire test set to be withheld and not 
just the labels as is required in the Kaggle application. The bigger obstacle is that the algorithm 
is unfortunately not computationally efficient and this is inherent. In fact, no computationally 
efficient algorithm can give non-trivial error on A: > n2+o(i) adaptively chosen functions as was 
shown recently [HU, SU] under a standard computational hardness assumption. 

Matching this hardness result, there is a computationally efficient algorithm in [DFH^] that 
achieves an error bound of which implies a bound on leaderboard accuracy 

that is worse than ours for all k > They also give an algorithm (called EffectiveRounds) with 
accuracy 0(^Jr\og{k)/n) when the number of “rounds of adaptivity" is at most r. While we do 
not have a bound on r in our setting better than k^, the proof technique relies on sample splitting 
and a similar argument could be used to prove our upper bound. However, our argument does 
not require sample splitting and this is very important for the practical applicability of the 
algorithm. 

We sidestep the hardness result by going to a more specialized notion of accuracy that is 
surprisingly still sufficient for the leaderboard application. However, it does not resolve the 
more general question raised in [DFH+]. In particular, we do not always provide a loss estimate 
for each submitted classifier, but only for those that made a significant improvement over the 
previous best. This seemingly innocuous change is enough to circumvent the aforementioned 
hardness results. 
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1.3 Preliminaries 

Let X be a data domain and Y be a finite set of class labels, e.g., X = IR'^ and Y = [0,1}. Ratber 
than speaking of the score of a classifier we will use the term loss with the understanding that 
smaller is better. A loss function is a mapping of the form Y x Y ^ [0,1] and a classifier is a 
mapping /: X ^ Y. A standard loss function is the 0/1-loss defined as ^oi(UF 0 = 1 if p ^ and 
0 otherwise. 

We assume that we are given a sample S = {(xi,),..., (x,„y„)} drawn i.i.d. from an unknown 
distribution V over X x Y. We define the empirical loss of a classifier / on the sample S as 

n i -—' 

! = 1 

Hhe parameter r corresponds to the depth of the adaptive tree we define in the proof of Theorem 3.1. While we 
bound the size of the tree, the depth could be as large as k. 
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The true loss is defined as 


Rv{f) =\ IE mix),y))]. 

(x,y)~V 

Throughout this paper we assume that S consists of n i.i.d. draws from T> and ^ is a loss function 
with bounded range. 


2 Sequential and Adaptive Loss Estimation 

In this section we formally define the adaptive model of estimation that we work in and present 
our definition of leaderboard accuracy. Given a sequence of classifiers and a finite 

sample S of size n, a fundamental estimation problem is to compute estimates Ri,...,Ry such 
that 

Fr{3t€[/c]: (1) 

The standard way of estimating the true loss is via the empirical loss. If we assume that all 
functions are fixed independently of the sample S, then Hoeffding's bound and the 

union bound imply 

lPr{3te[k]: \Rs{ft)-Rv{ft)\> £] ^2kexp(-2e^n). (2) 

In the adaptive setting, however, we assume that the classifier /{ may be chosen as a function of 
the previous estimates and the previously chosen classifiers. Formally, there exists a mapping A 
such that for all t e[k]: 

ft = ■^(fi>Ri>--->ft-i>Rt-i)- 

We will assume for simplicity that is a deterministic algorithm. The tuple {fi,Ri,...,f_i,Rt_i) 
is nevertheless a random variable due to the random sample used to compute the estimates. 

Unfortunately, in the case where the choice of f depends on previous estimates, we may 
no longer apply Hoeffding's bound to control Rsift)- In fact, recent work [HU, SU] shows that 
no computationally efficient estimator can achieve error o(l) on more than adaptively 

chosen functions (under a standard hardness assumption). Since we're primarily interested 
in a computationally efficient algorithm, these hardness results demonstrate that the goal of 
achieving the accuracy guarantee specified in inequality (1) is too stringent in the adaptive 
setting when k is large. We will therefore introduce a weaker notion of accuracy called leader- 
hoard accuracy under which we can circumvent the hardness results and nevertheless achieve a 
guarantee strong enough for our application. 


2,1 Leaderboard Accuracy 


The goal of an accurate leaderboard is to guarantee that at each step t < k, the leaderboard 
accurately reflects the best classifier among those classifiers submitted so far. In other 

words, while we do not need an accurate estimate for each f, we wish to maintain that the t-th 
estimate Rt correctly reflects the minimum loss achieved by any classifier so far. This leads to 
the following definition. 

Definition 2.1. Given an adaptively chosen sequence of classifiers we define the 

leaderboard error of estimates 7?i, ...,Rk as 


\herr(Ri,...,RA = max 


min Rv{fi)-Rt 


(3) 
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Given an algorithm that achieves high leaderhoard accuracy there are two simple ways to 
extend it to provide a full leaderhoard: 

1. Use one instance of the algorithm for each team to maintain the hest score achieved hy 
each team. 

2. Use one instance of the algorithm for each rank on the leaderhoard. When a new submis¬ 
sion comes in, evaluate it against each instance in descending order to determine its place 
on the leaderhoard. 

The first variant is straightforward to implement, hut requires the assumption that competitors 
don't use several accounts (a practice that is typically against the terms of use of a competition). 
The second variant is more conservative and does not need this assumption. 

3 The Ladder Mechanism 

We introduce an algorithm called the Ladder Mechanism that achieves small leaderhoard 
accuracy. The algorithm is very simple. For each given function, it compares the empirical loss 
estimate of the function to the previously smallest loss. If the estimate is helow the previous 
hest hy some margin, it releases the estimate and updates the hest estimate. Importantly, if the 
estimate is not smaller hy a margin, the algorithm releases the previous hest loss (rather than 
the new estimate). A formal description follows in Figure 1. 


Input: Data set S, step size rj > 0 

Algorithm: 

- Assign initial estimate Rq «— co. 

- For each round t <— 1,2... 

1. Receive function ff. X 
2- If Rsift) < Rt-i - assign R 

3. Output Rt 


t 


[Rs{ft)]r,- Else assign Rt ^ 


Figure 1: The Ladder Mechanism. We use the notation [x],^ to denote the number x rounded to the 
nearest integer multiple of rj. 


Theorem 3.1. For any sequence of adaptively chosen classifiers fi,...,fkr the Ladder Mechanism 
satisfies for all t and e> 0, 

> £ -h < exp(-2£^n-i- (1/f^ -h 2)log(4t/fj) + l) • (4) 

In particular, for some q = 0{n~^^^ (kn)), the Ladder Mechanism achieves with high probability, 

lberr(,... 

Proof. Let Abe the adaptive analyst generating the function sequence. Fix t < fc. The algorithm A 
naturally defines a rooted tree T of depth t recursively defined as follows: 
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1. The root is labeled by /i = ^(0). 

2. Each node at depth 1 < z < t corresponds to one realization {hi,ri,...,hj_i,ri_i) of the ran¬ 
dom variable and is labeled by /i, = A{hi,ri,...,hj_i, r,_i). Its children 

are defined by each possible value of the output 7?,- of Ladder Mechanism on the sequence 

Claim 3.2. Let B = (l/rj + 2)\og(At/rj). Then, \T\ < 2®. 

Proof. To prove the claim, we will uniquely encode each node in the tree using B bits of 
information. The claim then follows directly The compression argument is as follows. We use 
[log(t)J < log(2t) bits to specify the depth of the node in the tree. We then specify the index 
of each 1 < f ^ t for which 7?, < Ri_i - rj together with the value 7?,. Note that since 7?,- € [0,1] 
there can be at most \l/rj^ < (l/f?) + 1 many such steps. Moreover, there are at most \l/rj^ 
many possible values for T?, = [Rs(fi)]t]- Hence, specifying all such indices requires at most 
(1/f^ -I- l)(log(2/f^) -I- log(27)) bits. It is easy that this uniquely identifies each node in the graph, 
since for every index i not explicitly listed we know that 7?,- = Rj-i - The total number of bits we 
used is: 

(1/f^ -h l)(log(2/f;)-hlog(27))-hlog(27) < (1/f^ -h 2)log(47/f^) = B. 


The theorem now follows by applying a union bound over all nodes in T and using Hoeffding's 
inequality for each fixed node. Let F be the set of all functions appearing in T. 

Pr{3/ € F: \Rv{f)-Rs{f)\ > < 2|f |exp(-2£^«) 

< 2®'*'^ exp(-2£^«) < 2exp(-2£^M -i- B). 


In particular. 


Pr 


min Rx)(fi)-min Rgif) 


> £[ < 2exp(-2£ « -I- B). 


Moreover, it is clear that conditioned on the event that 


min Rj^ifi)-min Rs(fi) 

1<1<Z 




at step i* where the minimum of Rx>{fi) is attained, the Ladder Mechanism must output an 
estimate 7?,-. which is within e + rj of Rvifi*)- This concludes the proof. ■ 


3.1 A lower bound on leaderboard accuracy 

We next show that Q(-y/iog(fc)7«) is a lower bound on the best possible leaderboard accuracy that 
we might hope to achieve. This is true even if the functions are not adaptively chosen but fixed 
ahead of time. 


Theorem 3.3. There are classifiers and a bounded loss function for which we have the 

minimax lower bound 


infsupE[lberr(7?(xi,...,x„))] > Q 

R 



Here the infimum is taken over all estimators R\ [0,1]*^ that take n samples from a distribution 

V and produce k estimates Ri,...,Rk = 6{xi,...,x„). The expectation is taken over n samples from V. 
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Proof. We will reduce the problem of mean estimation in a certain high-dimensional distribution 
family to that of obtaining small leaderboard error. Our lower bound then follows from lower 
bounds for the corresponding mean estimation problem. 

Let X = and take the functions /i,...,/jt to be the k coordinate projections fi(x) = x;, for 
1 < i < fc. Let the loss function be the projection onto its first argument t{y',y) - y' so that 
£{fi{x),y) = X,'. Consider the family of distributions where is uniform over 

{0,1}^ except that the i-th coordinate satisfies = 1/2 - £ for some e € (0,1/4) that we will 

determine later. Now, we have 





o.w. 


Denote the mean of a distribution D by 0(D) = [^] ^od note that 0(D) = {R-Difi), ■ ■ ■ ,Rv{fk)V ■ 

We claim that obtaining small leaderboard error on is at least as hard estimating the 

means of an unknown distribution in Dj. Formally, 


inf sup E[lberr(X(xi,...,x„))] ^ -inf sup e[||0(xi,..., x„)- 0(D)|| ]. (5) 

« DeD, 3 e VeD, ^ 


Indeed, let R be the estimator that achieves minimax leaderboard accuracy. Define the estimator 
0 as follows: 


1. Given xi,...,x„ compute Ri,...Rk = R(xi,...,x„). 

2. Let i be the first coordinate in the sequence Ri,...,Rk which is less than 1/2 - e/2. 

3. Output the vector 0(xi,...,x„) which is 1/2-£ in the i-th coordinate and 1/2 everywhere 
else. 

Note that the Lco-error of 0 is always at most e, since all means parameters in the family 
are £-close in ^^o-norm. Suppose then that lberr(X(xi,...,x„)) < e/3 and suppose that D; is the 
unknown distribution for some i € [k]. In this case, we claim that ||0(xi,.. .,x„) - 0(D,)||;,o = 0. 
Indeed, the first coordinate for which J?(xi,..., x„) is less than 1/2 - e/2 must be i. This follows 
from the definition of leaderboard error and the assumption that R had error e/3 on Xj,...,x„. 
This establishes inequality (5). 

Finally, it is well known and follows from Fano's inequality that for some £ = Q(-y/iog(fc)7«), 

mf sup e[||0(xi,...,x„)-0(D)|| I ^ £. 

For completeness we include the argument. Let L be a random index in [k] and assume that 
X is a random sample from D,- conditional on L = i. Note that the set P = {6{'Di)}ie[k] forms an 
(£/2)-packing in the Hence, by Fano's inequality (see e.g. [Has, Tsy]), 

mf sup E[||0(xi,...,x„)-0(D)|U] 

0 VeD, ^ \ iog |i I / 


where I{V;X) is the mutual information between V and X. Moreover, it is known that 


f(L;X”)< 



i,ie[k] 
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In the second inequality we used that the Kullhack-Leihler divergence between a Bernoulli 
random variable with bias 1/2 and another one with bias 1/2-c is at most 0(e^} for all 0 < £ < 1/4. 
Moreover, the Kullhack-Leihler divergence of n independent samples is at most n times the 
divergence of a single sample. We conclude that 

infsup E[||0ta.^.)-e(P)|L]? 

Setting £ = c^J\og(k)/n for small enough constant c> 0 completes the proof. ■ 


4 A parameter-free Ladder mechanism 


When applying the Ladder Mechanism in practice it can be difficult to choose a fixed step 
size rj ahead of time that will work throughout an entire competition. We therefore now give a 
completely parameter-free version of our algorithm that we will use in our experiments. The 
algorithm adaptively finds a suitable step size based on previous submissions to the algorithm. 
The idea is to perform a statistical significance test to judge whether the given submission 
improves upon the previous one. The test is such that as the best classifier gets increasingly 
accurate, the step size shrinks accordingly. 

The empirical loss of a classifier is the average of n bounded numbers and follows a very 
accurate normal approximation for sufficiently large n so long as the loss is not biased too 
much towards 0. In our setting, the typical loss if bounded away form 0 so that the normal 
approximation is reasonable. In order to test whether the empirical loss of one classifier is 
significantly below the empirical loss of another classifier, it is appropriate to perform a one¬ 
sided paired t-test. A paired test has substantially more statistical power in settings where the 
loss vectors that are being compared are highly correlated as is common in a competition. 

To recall the definition of the test, we denote the sample standard deviation of an n-dimensional 

vector vector u as std(u) = -meanju))^, where mean(u) denotes the average of the 

entries in u. With this notation, the paired t-test statistic given two vectors u and v is defined as 


t = yfn ■ 


mean(u - v) 
std(u -v) 


( 6 ) 


Keeping this definition in mind, our parameter-free Ladder mechanism in Figure 2 is now very 
natural. On top of the loss estimate, it also maintains the loss vector of the previously best 
classifier (starting with the trivial all zeros loss vector). 

The algorithm in Figure 2 releases the estimate of Rsift) up to an error of 1/n which is 
significantly below the typical step size of Q(1 /Vm). Looking back at our analysis, this is not a 
problem since such an estimate only reveals log(n) bits of information which is the same up to 
constant factors as an estimate that is accurate to within l/^/n. The more critical quantity is the 
step size as it controls how often the algorithm releases a new estimate. 

In the following sections we will show that the parameter-free Ladder mechanism achieves 
high accuracy both under a strong attack as well as on a real Kaggle competition. 


4.1 Remark on the interpretation of the significance test 

For sufficiently large n, the test statistic on the left hand side of (6) is well approximated by a 
Student's t-distribution with n-l degrees of freedom. The test performed in our algorithm at 
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Input: Data set S = {(xi,yi),... (x„...,y„)} of size n 

Algorithm: 

- Assign initial estimate Rq oo, and loss vector = (0)”^j. 

- For each round t ^ 1,2...,k : 

1. Receive function /(-. X ^ Y. 

2. Compute loss vector /( 

3. Compute the sample standard deviation s std(/t - /-i). 

4 . l^Rs(ft)<Rt-l-s/^fi 

(a) Rt^[Rs{ft)]i/n- 

5. Else assign Rf <— Rt_i and It <— It-i- 

6. Output Rt 


Figure 2: The parameter-free Ladder Mechanism. We use the notation [x]^ to denote the number x 
rounded to the nearest integer multiple of fj. 

each step corresponds to refuting the null hypothesis roughly at the 0.15 significance level. 

It is important to note, however, that our use of this significance test is primarily heuristic. 
This is because for t > 1, due to the adaptive choices of the analyst, the function ft may in 
general not be independent of the sample S. In such a case, the Student approximation is no 
longer valid. Besides we apply the test many times, but do not control for multiple comparisons. 
Nevertheless, the significance test is an intuitive guide for deciding which improvements are 
statistically significant. 

5 The boosting attack 

In this section we describe a new canonical attack that an adversarial analyst might perform in 
order to boost their ranking on the public leaderboard. Besides being practical in some cases, 
the attack also serves as an analytical tool to assess the accuracy of concrete mechanisms. 

For simplicity we describe the attack only for the 0/1-loss although it generalizes to other 
reasonable functions such as the clipped logarithmic loss often used by Kaggle. We assume 
that the hidden solution is a vector y e {0,1}”. The analyst may submit a vector u e {0,1}" and 
observe (up to small enough error) the loss 

1 ” 

4i(TW) =^-y 

n i -—' 

! = 1 


The attack proceeds as follows: 

1. Pick uj,..., uj. € {0,1}” uniformly at random. 

2. Observe loss estimates /i,...,4 e [0,1]. 

3. Let I = {i: /; < 1/2}. 

4. Output u* = maj ({u,-: i e fj), where the majority function is applied coordinate-wise. 
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The vector y corresponds to the target set of labels used for the public leaderboard which 
the analyst does not know. The vectors mj, ..., uj. represent the labels given by a sequence of k 
classifiers. 

The next theorem follows from a standard “boosting argument" using properties of the 
majority function and the fact that each u, for i e I has a somewhat larger than expected 
correlation with y. 

Theorem 5.1. Assume that \li -£oi{y>^i)\ ^ for all i e [k]. Then, the boosting attack finds a 
vector u* e {0,1}” so that -with probability 2/3, 


1 \ 1 

-Y T - ^ 

n i- —' 2 

;=1 



The previous theorem in particular demonstrates that the Kaggle mechanism has poor 
leaderboard accuracy if it is invoked with rounding parameter a < l/^fn. The currently used 
rounding parameter is 10“^ which satisfies this assumption for all n < 10^®. 

Corollary 5.2. There is a sequence of adaptively chosen classifiers fi,---,fk such that ifRi denotes the 
minimum of the first i loss estimates returned by the Kaggle mechanism (as described in Figure 7) 
■with accuracy a < \/^Jn where n is the size of the data set, then with probability 2/3 the estimates 
Ri,...,Rk have leaderboard error 


r 

lberr(7?i,...,7?j.) ^ Q 



5.1 Experiments with the boosting attack 

Figure 3 compares the performance of the Ladder mechanism with that of the standard Kaggle 
mechanism under the boosting attack. We chose N = 12000 as the total number of labels of 
which n = 4000 labels are used for determining the public leaderboard under either mechanism. 
Other parameter settings lead to a similar picture, but these settings correspond roughly to 
the properties of the real data set that we will analyze later. The Kaggle mechanism gives 
answers that are accurate up to a rounding error of 10“^. Note that 1/V4000 « 0.0158 so that 
the rounding error is well below the critical level of l/^fn. The vector y in the description of 
our attack corresponds to the 4000 labels used for the public leaderboard. Since the answers 
given by Kaggle only depend on these labels, the remaining labels play no role in the attack. 
Importantly, the attack does not need to know the indices of the labels used for the public 
leaderboard within the entire vector of labels. 

The 8000 coordinates not used for the leaderboard remain unbiased random bits throughout 
the attack as no information is revealed. In particular, the final submission u* is completely 
random on those 8000 coordinates and only biased on the other 4000 coordinates used for the 
leaderboard. Therefore, once we evaluate the final submission u* on the test set consisting of 
the remaining 8000 coordinates, the resulting loss is close to its expected value of 1/2, i.e. the 
expected loss of a random 0/1 -vector. What we observe, however, is that the Kaggle mechanism 
gives a strongly biased estimate of the loss of u*. 

The blue line in Figure 3 displays the performance of the parameter-free version of the 
Ladder mechanism. Instead of selecting all the vectors with loss at most 1/2 we modified the 
attack to be more effective against the Ladder Mechanism. Specifically, we selected all those 


11 







vectors that successfully lowered the score compared to the previous best. As we have no 
information about the correlation of the remaining vectors, there is no benefit in including them 
in the boosting step. Even with this more effective attack, the Ladder mechanism gives a result 
that is correct to within the expected maximum deviation of the score on k random vectors. The 
intuitive reason is that every time a vector lowers the best score seen so far, the probability of a 
subsequent vector crossing the new threshold drops off by a constant factor. In particular there 
cannot be more than 0(log(/c)) such steps thus creating a bias of at most 0(^J\og{k)/n) in the 
boosting step. 


Ladder vs Kaggle (normal precision) 


Ladder vs Kaggle (normal precision) 
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Figure 3: Performance of the parameter free Ladder Mechanism compared with the Kaggle Mechanism. 
Top green line; Independent test set. Middle blue line: Ladder. Bottom red line: Kaggle. Left: Kaggle 
with large rounding parameter \/yfn 0.0158. Right: Kaggle with normal rounding parameter 0.00001. 
All numbers are averaged over 5 independent repetitions of the experiment. Number of labels used is 
n = 4000. 


6 Experiments on real Kaggle data 

To demonstrate the utility of the Ladder mechanism we turn to real submission data from 
Kaggle's "Photo Quality Prediction" challenge^. Here is some basic information about the 
competition. 


Number of test samples 

- used for private leaderboard 

- used for public leaderboard 

12000 

8400 

3600 

Number of submissions 
- processed successfully 

1830 

1785 

Number of teams 

200 


Our first experiment is to use the parameter-free Ladder mechanism in place of the Kag¬ 
gle mechanism across all 1785 submissions and recompute both the public and the private 
leaderboard. The resulting rankings turn out to be very close to those computed by Kaggle. Lor 
example. Table 1 shows the only perturbations in the ranking among the top 10 submissions. 

^https://www.kaggle.com/c/PhotoOualityPrediction 
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Private 

Public 

Kaggle 

6 

8 

5 

6 

7 

Ladder 

8 

6 

7 

5 

6 


Table 1: Perturbations in the top 10 leaderboards 


Figure 4 plots the public versus private scores of the leading 50 submissions (w.r.t the private 
leaderboard). The diagonal line indicates an equal private and public score. The plot a small 
amount of underfitting between the public and private scores. That is, the losses on the public 
leaderboard generally tend to be slightly higher than on the private leaderboard. This appears 
to be due random fluctuations in the proportion of hard examples in the public holdout set. 

To assess this possibility and gain further insight into the magnitude of statistical deviations 
of the scores, we randomly split the private holdout set into two equally sized parts and 
recompute the leaderboards on each part. We repeat the process 20 times independently and 
look at the standard deviations of the scores across these 20 repetitions. Figure 5 shows the 
results demonstrating that the statistical deviations due to random splitting are large relative to 
the difference in mean scores. In particular the amount of underfitting observed on the original 
split is within one standard deviation of the mean scores which cluster close to the diagonal line. 
We also observed that the top 50 scores are highly correlated so that across different splits the 
points are either mostly above or mostly below the diagonal line. This must be due to the fact 
that the best submissions in this competition used related classifiers that fail to predict roughly 
the same label set. 

Top 50 Kaggle scores on private data Top 50 Ladder scores on private data 




Figure 4: Private versus public scores for the top 50 submissions. Left: Kaggle. Right: Ladder. 


6.1 Statistical significance analysis 

To get a better sense of the statistical significance of the difference between the scores of com¬ 
peting submissions we performed a sequence of significance tests. Specifically, we considered 
the top 10 submissions taken from the Kaggle public leaderboard and tested on the private data 
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Kaggle top 50 scores on fresh splits 



0.18 0.19 0.20 

Public score 


Ladder top 50 scores on fresh splits 



0.18 0.19 0.20 

Public score 


Figure 5: Fluctuations of scores across 20 independent splits of the private data. Dots represent mean 
scores. Error bars indicate a single standard deviation in each direction. 


if the true score of the top submission is significantly different from the rank r submission for 
r = 2,3,..., 10. A suitable test of significance is the paired t-test. The score of a submission is 
the mean of a large number of samples in the interval [0,2] and follows a sufficiently accurate 
normal approximation. We chose a paired t-test rather than an unpaired t-test, because it has far 
greater statistical power in our setting. This is primarily due to the strong correlation between 
competing submissions. See Equation 6 for a definition of the test statistic. Note that the data 
that determined the selection of the top 10 classifiers is independent of the data used to perform 
the significance tests. 

Figure 6 plots the resulting p-values before and after correction for multiple comparisons. 
We see that after applying a Bonferroni correction, the only submissions with a significantly 
different mean are 8 and 9. 
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Figure 6: Significance test for the difference between score of top submission and rank r submission. 
Left: Before multiple comparison correction. Right: After Bonferroni correction. 
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These observations give further evidence that the small perturbations we saw in the top 10 
leaderboard between the Kaggle mechanism and the Ladder mechanism are below the level of 
statistical significance. 

7 Conclusion 

We hope that the Ladder mechanism will be helpful in making machine learning competitions 
more reliable. Beyond the scope of machine learning competitions, it is conceivable that the 
Ladder mechanism could be useful in other domains where overfitting is currently a concern. 
For example, in the context of false discovery in the empirical sciences [loa, GL], one could 
imagine using the the Ladder mechanism as a way of keeping track of scientific progress on 
important public data sets. 

Our algorithm can also be seen as an intuitive explanation for why overfitting to the holdout 
is sometimes not a major problem even in the adaptive setting. If indeed every analyst only 
uses the holdout set to test if their latest submission is well above the previous best, then they 
effectively simulate our algorithm. 

A beautiful theoretical problem is to resolve the gap between our upper and lower bound. On 
the practical side, it would be interesting to use the Ladder mechanism in a real competition. One 
interesting question is if the Ladder mechanism actually encourages higher quality submissions 
by requiring a certain level of statistically significant improvement over previous submissions. 
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A Kaggle reference mechanism 

As we did for the Ladder Mechanism we describe the algorithm as if the analyst was submitting 
classifiers f: X ^ Y.ln reality the analyst only submits a list of labels. It is easy to see that such 
a list of labels is sufficient to compute the empirical loss which is all the algorithm needs to do. 
The input set S in the description of our algorithm corresponds to the set of data points (and 
corresponding labels) that Kaggle uses for the public leaderboard. 


Input: Data set S, rounding parameter a > 0 (typically 0.00001) 

Algorithm: 

- For each round t l,2...,k : 

1. Receive function ft: X ^ Y 

2 . Output [Rs (/()]«• 


Figure 7: Kaggle reference mechanism. We use the notation [xj^, to denote the number x rounded to the 
nearest integer multiple of a. 
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